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The  goal  of  this  projecct  was  to  develop  a  new  framework  to  study  information 
exchanges  with  constraints  in  the  delays  and  computational  complexity,  from  an 
information  theoretic  perspective. 

The  project  is  based  on  a  key  difference  between  digital  communication  problems 
and  statistical  inference  problems,  namely,  in  communications,  we  often  have  the 
goal  of  delivering  the  information  as  an  entirty,  such  that  every  bit  of  the 
information  is  reliably  conveyed;  whereas  in  inference  problems,  it  is  often  the  case 
that  we  are  only  interested  in  a  special  feature  in  the  data,  and  do  not  wish  to 
reconstruct  the  entire  observed  data  after  processing.  This  is  a  fundamental 
difference,  as  in  the  later  cases,  one  has  to  distinguish  between  different 
components  in  an  observation,  make  sure  the  relevant/important  information 
remain  intact  after  processing  and  transmission,  and  often  discard  the  rest  of  the 
information.  This  notion  of  lossy  processing,  or  information  discipation  in  the 
processing,  is  a  concept  that  is  lacking  the  conventional  information  theoretic 
analysis. 

We  have,  in  the  first  a  few  years  of  this  project,  developed  a  new  framework  which 
we  call  "linear  information  coupling”.  In  this  setup,  we  think  any  informtion 
exchange  as  a  variation  of  the  posterior  distribution  of  the  information-carrying 
message.  While  the  variation  of  the  distribution  is  often  a  very  high-dimensional 
object,  we  define  an  ortho-normal  basis  in  the  space  of  probability  distributions.  We 
choose  this  basis  in  correspondence  to  the  SVD  spectrum  of  the  observation  model, 
so  that  we  can  decompose  the  calculation  of  the  posterior  distributions  into 
computing  a  sequence  of  scores.  These  scores  are  ordered  and  labeled  by  the 
information  contents  they  contain.  With  this  general  structure,  we  can,  instead  of 
computing  the  entire  posterior  distribution  of  the  message,  compute  only  a  few,  say, 
k,  scores  from  the  top  of  the  list,  knowing  that  these  score  values  contains  the 
maximum  amount  of  useful  information  among  all  k-dimensional  statstics  that  can 
be  extracted  from  the  observations.  We  call  these  scores  "efficient  statistics”.  They 
are  not  sufficient  statistics  in  the  conventional  sense,  as  they  do  not  contain  all  the 
information  in  the  entire  observation,  but  are  the  most  informative  functions  one 
can  compute,  given  the  computation  complexity  that  one  can  afford. 

The  idea  of  information  coupling  has  been  applied  to  a  variety  of  problems.  This 
includes  new  techniques  that  can  be  used  for  the  traditional  network  commnication 
problems,  as  well  as  inference  problems.  Some  of  these  results  are  reported  in  the 
following  publications. 
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Shao-Lun  Huang,  Anuran  Makur,  Fabian  Kozynski,  Lizhong  Zheng:  Efficient  Statistics: 
extraing  Information  Form  IID  Observations.  Allerton  Conference  2014 

Shao-Lun  Huang,  Lizhong  Zheng:  The  Linear  Information  Coupling  Problems.  CoRR 
abs/1406.2834  " 

Shao-Lun  Huang,  Changho  Suh,  Lizhong  Zheng:  Euclidean  information  theory  of 
networks.  Accepted  IEEE  Trans.  Info.  Theory,  August,  2015. 

Near  the  end  of  this  project,  we  have  made  another  important  finding.  In  all  of  our 
previous  works,  we  assumed  that  the  noisy  observation  channel  model  is  precisely 
known.  This  is,  however,  a  problematic  assumption  in  reality.  With  the  help  of  the 
SVD  structure,  we  came  up  with  an  algorithm,  which  is  a  generalization  of  the 
Alternating  Conditional  Expectation  (ACE)  algorithm,  which  efficiently  estimate  the 
top  singular  vectors  of  the  DTM,  and  use  them  as  the  score  functions,  or  efficient 
statistics.  This  put  our  result  in  the  general  category  of  dimension  reduction/non¬ 
linear  feature  selection  problems. 

Theoretically,  the  reason  for  this  simple  algorithm  to  work  is  related  to  the  concept 
of  Renyi  maximal  correlation.  In  1959,  Renyi  proposed  a  measure  of  the  dependence 
between  two  random  variables,  by  finding  a  pair  of  functions,  one  for  each  of  the 
random  variables,  which  are  highly  correlated.  In  our  application,  one  variable 
corresponds  to  the  user,  the  other  to  the  service.  Finding  the  Renyi  correlation  in 
this  problem  can  thus  be  understood  as  finding  a  particular  aspect  of  the  services 
that  can  be  used  to  distinguish  the  users,  i.e.  to  answer  the  question  'which  user  is 
more  likely  to  use  a  certain  portfolio  of  services?'  Our  algorithm  is  in  fact  a  very 
efficient  way,  not  only  to  compute  the  maximal  Renyi  correlation,  but  also  to  find  the 
best  choices  of  200  pairs  of  functions.  As  we  distill  everything  we  know  about  a  user 
into  a  200  dimensional  signature,  we  inevitably  discard  some  knowledge  about  this 
user;  but  with  our  algorithm,  it  is  guaranteed  that  these  are  the  200  values  that 
carry  the  most  amount  of  useful  information,  for  the  purpose  of  predicting  their 
service  preferences. 

The  conceptual  step  made  in  this  algorithm  is  that  instead  of  estimating  the 
complete  statistical  model  of  the  observation  channel,  we  can  instead  only  estimate 
one  "mode”  of  it,  corresponding  to  the  maximal  Renyi  correlation,  and  the  singular 
vector  of  the  DTM  with  the  leading  singular  value.  We  show  that  this  requires  a 
significantly  smaller  number  of  training  samples,  which  is  the  critical  issue  for  most 
Big  Data  problems.  On  the  other  hand,  the  resulting  statistics  are  also  information 
theoretically  optimal,  in  the  sense  that  they  carry  the  largest  amount  of  information, 
for  the  given  number  of  statistics.  It  is  thus  quite  surprising  that  our  result  gives  the 
optimal  tradeoff  between  three  objectives:  the  inference  performance,  the  number 
of  statistics  used,  and  the  sample  complexity  to  learn  these  score  functions. 

There  are  some  further  advantages  of  our  algorithm.  The  most  useful  one  in  practice 
is  the  generality  of  the  algorithm.  Instead  of  requiring  specific  forms  of  data,  and 
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restricting  to  specific  type  of  score  functions,  such  as  linear  functions  over  real¬ 
valued  data  in  PCA,  our  algorithm  is  much  more  general,  and  can  be  used  to  process 
information  with  different  types  of  data,  discrete  or  continuous  valued.  The  score 
functions  selected  from  this  approach  are  in  general  non-linear,  which  shed  lights  to 
the  difficult  problem  of  non-linear  feature  selection.  We  can  also  put  additional 
constraints  for  specific  types  of  score  functions  in  our  formulation.  For  example, 
when  we  apply  to  real  valued  data  and  insist  the  score  functions  to  be  linear,  we 
indeed  recover  the  PCA  solution  as  a  special  case.  Thus,  our  algorithm  can  be  used  in 
process  and  integrate  data  from  different  applications,  to  jointly  improve  the 
performance  in  inference. 

At  the  very  end  of  this  project,  we  just  started  reporting  our  results  in  publications. 
The  first  one  of  them  is 

Anuran  Makur,  Shao-Lun  Huang,  Fabian  Kozynski,  Lizhong  Zheng,  "An  Efficient 
Algorithm  for  Information  Decomposition  and  Extraction"  Allerton  conference,  Oct. 
2015 

We  have  applied  this  algorithm  in  a  number  of  realistic  problems,  including  the 
Netflix  problem,  a  community  detection  problem  with  Facebook  connection  graphs, 
and  a  detection  problem  with  audio  signals. 
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